Performance of Cross Validation in Tree-Based Models
نویسندگان
چکیده
Cross Validation (CV) is widely used to measure the performance of a classifier. The main purpose of this study is to explore the behavior of CV in tree-based models. We report experimental studies that compare a cross-validated tree classifier with an oracle classifier that is ideally derived on the knowledge of underlying distributions. The main observation of this study indicates that the difference between the testing and training error from a cross-validated tree classifier and an oracle classifier empirically has a linear regression relation. The “slope” and the “R2” of regression models are employed as the performance measures of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, the parameters of the underlying distributions, and sample size. Such observations can explain and justify the behavior of CV in tree-based models.
منابع مشابه
Use of classification tree methods to study the habitat requirements of tench (Tinca tinca) (L., 1758)
Classification trees (J48) were induced to predict the habitat requirements of tench (Tinca tinca). 306 datasets were used for the given fish during 8 years in the river basins in Flanders (Belgium). The input variables consisted of the structural-habitat (width, depth, gradient slope and distance from the source) and physic chemical (pH, dissolved oxygen, water temperature and electric conduct...
متن کاملPresenting a Model for Predicting Tax Evasion of Guilds Based on Data Mining Technique
In this research, considering the importance of the topic and the gap in previous researches, a model for predicting tax evasion of guilds based on data mining technique is presented. The analyzed data includes the review of 5600 tax files of all trades with tax codes in Qazvin province during the years 2013-2018. The tax file related to guilds is in five tax groups, including the guild group o...
متن کاملReal-time quality monitoring in debutanizer column with regression tree and ANFIS
A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the outp...
متن کاملIdentifying Student Behavior for Improving Online Course Performance with Machine Learning
In this study we investigate the correlation between student behavior and performance in online courses. Based on the web logs and syllabus of a course, we extract features that characterize student behavior. Using machine learning algorithms, we build models to predict performance at end of the period. Furthermore, we identify important behavior and behavior combinations in the models. The res...
متن کاملStatistical process control for validating a classification tree model for predicting mortality - A novel approach towards temporal validation
Prediction models are postulated as useful tools to support tasks such as clinical decision making and benchmarking. In particular, classification tree models have enjoyed much interest in the Biomedical Informatics literature. However, their prospective predictive performance over the course of time has not been investigated. In this paper we suggest and apply statistical process control metho...
متن کامل